7 research outputs found
Supporting Account-based Queries for Archived Instagram Posts
Social media has become one of the primary modes of communication in recent times, with popular platforms such as Facebook, Twitter, and Instagram leading the way. Despite its popularity, Instagram has not received as much attention in academic research compared to Facebook and Twitter, and its significant role in contemporary society is often overlooked. Web archives are making efforts to preserve social media content despite the challenges posed by the dynamic nature of these sites. The goal of our research is to facilitate the easy discovery of archived copies, or mementos, of all posts belonging to a specific Instagram account in web archives. We proposed two approaches to support account-based queries for archived Instagram posts. The first approach uses existing technologies in the Internet Archive by using WARC revisit records to incorporate Instagram usernames into the WARC-Target-URI field in the WARC file header. The second approach involves building an external index that maps Instagram user accounts to their posts. The user can query this index to retrieve all post URLs for a particular user, which they can then use to query web archives for each individual post. The implementation of both approaches was demonstrated, and their advantages and disadvantages were discussed. This research will enable web archivists to make informed decisions on which approach to adopt based on practicality and unique requirements for their archives
MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations
Metadata quality is crucial for digital objects to be discovered through digital library interfaces. Although DL systems have adopted Dublin Core to standardize metadata formats (e.g., ETD-MS v1.11), the metadata of digital objects may contain incomplete, inconsistent, and incorrect values [1]. Most existing frameworks to improve metadata quality rely on crowdsourced correction approaches, e.g., [2]. Such methods are usually slow and biased toward documents that are more discoverable by users. Artificial intelligence (AI) based methods can be adopted to overcome this limit by automatically detecting, correcting, and canonicalizing the metadata, featuring quick and unbiased responses to document metadata. This paper uses Electronic Theses and Dissertations (ETDs) metadata as a case study and proposes an AI-based framework to improve metadata quality.
ETD represents scholarly works of students who pursue higher education and successfully meet the partial requirement of a degree. ETDs are usually hosted by university libraries or ProQuest. Using web crawling techniques, we collected metadata and full text of 533,047 ETDs from 114 American universities. Upon inspecting the metadata of these ETDs, we noticed many ETD repositories are accompanied by incomplete, inconsistent, or incorrect metadata. We propose MetaEnhance, a framework that utilizes state-of-the-art AI methods to improve the quality of seven key metadata fields, including title, author, university, year, degree, advisor, and department. To evaluate MetaEnhance, we compiled a benchmark containing 500 ETDs, by combining subsets sampled using different criteria. We evaluated MetaEnhance against this benchmark and found that the proposed methods achieved remarkable performance in detecting and correcting metadata errors.https://digitalcommons.odu.edu/gradposters2023_sciences/1013/thumbnail.jp
Robots Still Outnumber Humans in Web Archives in 2019, But Less Than in 2012
To identify robots and human users in web archives, we conducted a study using the access logs from the Internet Archive’s (IA) Wayback Machine in 2012 (IA2012), 2015 (IA2015), and 2019 (IA2019), and the Portuguese Web Archive (PT) in 2019 (PT2019). We identified user sessions in the access logs and classified them as human or robot based on their browsing behavior. In 2013, AlNoamany et al. [1] studied the user access patterns using IA access logs from 2012. They established four web archive user access patterns: single-page access (Dip), access to the same page at multiple archive times (Dive), access to distinct web archive pages at about the same archive time (Slide), and access to a list of archived pages (TimeMaps) for a certain URL (Skim). They also determined that in the 2012 IA access logs, humans were outnumbered by robots by 10:1 in terms of sessions and 5:4 in terms of raw HTTP accesses. We extended their work by presenting a comparison of detected robots vs. humans and their access patterns and temporal preferences based on the two archives (IA vs. PT) and between three years of IA access logs (IA2012, IA2015, IA2019). The total number of robots detected in IA2012 (91% of requests) and IA2015 (88% of requests) is greater than in IA2019 (70% of requests). Robots account for 98% of requests in PT2019. We found that the robots are almost entirely limited to Dip and Skim access patterns in IA2012 and IA2015, but exhibit all the patterns and their combinations in IA2019. We also investigated the temporal preferences of the users and discovered that both humans and robots favor web pages that have been archived recently.
[1] AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: JCDL ’13: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. pp. 339–348 (2013), https://dl.acm.org/doi/10.1145/2467696.2467722https://digitalcommons.odu.edu/gradposters2023_sciences/1022/thumbnail.jp
Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations
Electronic Theses and Dissertations (ETDs) contain domain knowledge that can
be used for many digital library tasks, such as analyzing citation networks and
predicting research trends. Automatic metadata extraction is important to build
scalable digital library search engines. Most existing methods are designed for
born-digital documents, so they often fail to extract metadata from scanned
documents such as for ETDs. Traditional sequence tagging methods mainly rely on
text-based features. In this paper, we propose a conditional random field (CRF)
model that combines text-based and visual features. To verify the robustness of
our model, we extended an existing corpus and created a new ground truth corpus
consisting of 500 ETD cover pages with human validated metadata. Our
experiments show that CRF with visual features outperformed both a heuristic
and a CRF model with only text-based features. The proposed model achieved
81.3%-96% F1 measure on seven metadata fields. The data and source code are
publicly available on Google Drive (https://tinyurl.com/y8kxzwrp) and a GitHub
repository (https://github.com/lamps-lab/ETDMiner/tree/master/etd_crf),
respectively.Comment: 7 pages, 4 figures, 1 table. Accepted by JCDL '21 as a short pape
MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries
Metadata quality is crucial for digital objects to be discovered through
digital library interfaces. However, due to various reasons, the metadata of
digital objects often exhibits incomplete, inconsistent, and incorrect values.
We investigate methods to automatically detect, correct, and canonicalize
scholarly metadata, using seven key fields of electronic theses and
dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that
utilizes state-of-the-art artificial intelligence methods to improve the
quality of these fields. To evaluate MetaEnhance, we compiled a metadata
quality evaluation benchmark containing 500 ETDs, by combining subsets sampled
using multiple criteria. We tested MetaEnhance on this benchmark and found that
the proposed methods achieved nearly perfect F1-scores in detecting errors and
F1-scores in correcting errors ranging from 0.85 to 1.00 for five of seven
fields.Comment: 7 pages, 3 tables, and 1 figure. Accepted by 2023 ACM/IEEE Joint
Conference on Digital Libraries (JCDL '23) as a short pape
Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists
As web archives' holdings grow, archivists subdivide them into collections so
they are easier to understand and manage. In this work, we review the
collection structures of eight web archive platforms: : Archive-It, Conifer,
the Croatian Web Archive (HAW), the Internet Archive's user account web
archives, Library of Congress (LC), PANDORA, Trove, and the UK Web Archive
(UKWA). We note a plethora of different approaches to web archive collection
structures. Some web archive collections support sub-collections and some
permit embargoes. Curatorial decisions may be attributed to a single
organization or many. Archived web pages are known by many names: mementos,
copies, captures, or snapshots. Some platforms restrict a memento to a single
collection and others allow mementos to cross collections. Knowledge of
collection structures has implications for many different applications and
users. Visitors will need to understand how to navigate collections. Future
archivists will need to understand what options are available for designing
collections. Platform designers need it to know what possibilities exist. The
developers of tools that consume collections need to understand collection
structures so they can meet the needs of their users.Comment: 5 figures, 16 pages, accepted for publication at TPDL 202
The DSA Toolkit Shines Light Into Dark and Stormy Archives
Web archive collections are created with a particular purpose in mind. A curator selects seeds, or original resources, which are then captured by an archiving system and stored as archived web pages, or mementos. The systems that build web archive collections are often configured to revisit the same original resource multiple times. This is incredibly useful for understanding an unfolding news story or the evolution of an organization. Unfortunately, over time, some of these original resources can go off-topic and no longer suit the purpose for which the collection was originally created. They can go off-topic due to web site redesigns, changes in domain ownership, financial issues, hacking, technical problems, or because their content has moved on from the original topic. Even though they are off-topic, the archiving system will still capture them, thus it becomes imperative to anyone performing research on these collections to identify these off-topic mementos. Hence, we present the Off-Topic Memento Toolkit, which allows users to detect off-topic mementos within web archive collections. The mementos identified by this toolkit can then be separately removed from a collection or merely excluded from downstream analysis. The following similarity measures are available: byte count, word count, cosine similarity, Jaccard distance, Sørensen-Dice distance, Simhash using raw text content, Simhash using term frequency, and Latent Semantic Indexing via the gensim library. We document the implementation of each of these similarity measures. We possess a gold standard dataset generated by manual analysis, which contains both off-topic and on-topic mementos. Using this gold standard dataset, we establish a default threshold corresponding to the best F1 score for each measure. We also provide an overview of potential future directions that the toolkit may take